Classical Arabic Poetry Categorization Using N-gram Frequency Statistics
نویسنده
چکیده
Most of the Arabic language vocabulary is built from the roots derivation. These roots are words composed of three to five consonants letters. Any performance in Arabic language for the purpose of information retrieval needs to deal with the language morphological and structural changes first (which is called the stemming process) then a statistical method for extracting information is implemented. This approach presents a method for categorizing the Classical Arabic Poetry (CAP) into its categorizations: Ghazal, Medeh, Wasef, Hijaa',..etc. by combining the algorithm of a light stemmer (which identify sets of prefixes and suffixes in an Arabic word in order to reach to the word root after removing the suffixes and prefixes) with "Ngram" statistical method (which retrieves the information independently of the language complexity). Two measures will be implemented: the "Manhattan distance" dissimilarity coefficient and the "Dice's measure" similarity coefficient for the purpose of categorization. تاددرت مادختسأب يكيسلاكلا يبرعلا رعشلا فينصت N-Gram ئاصحلأا ةي دمحم يقابلا دبع لابقا ٌُقخنا ىَهعخنا تئَه ،تَحصناو تَبطنا ثبَُقخنا تَهك دادغب ، – .قارعنا ةصلاخلا وممم بممد م لامم اه ممم جلممذ ل ةلمم تلامم اه ل جلمممذ لاتاملمما ومم بمميتنم بمممين ع ل بمملا ل للا ممدم مممظعم بمممين ع ل بمملا ل ةممماى ة مممذل بمميامى ع تبتهامممر ي مممن بممرمأ ةممم ل بممثلاث ممما لل لاممم جاعم ل عاذ لممر سل مممغلأ بيلامممم ن بممم ي ماألمممرت ممممث يلمممذل ل ةممممرل بممميامع ل ةلمممم ا ىمج امممعلاتن لل مممييللج بممملا ل لاية ممم اممم عممم اعل ل مم ج بمةج عم ل ةاتمم ةم ل هيممرلاه ل من ع ل عمما ل ييتم ل بمم ي مام ي ممنن ل للم تلامم جاعم ل عاذ لمرلأ تتت .امذع ل يم ج ل ام ل علمل ل ايمامذم لميمل مل لا يميدأ ل يلمذل ل بميمل لجأ وم عمه ملاألمران بم لج م ل بم ي ام بم اه ل لمذ ةم ل عجم ج ل س مل اعةلمن ممث وم ج بمين ع ل بم اه ل مة بميلاعت لج بيماملأل لاةاضلأل وم N-gram ممر لامرايم ل وم ويىجمت ت بملا ل للاميمعل مة سجمأ ل وجا لام جاعم ل اذ لمرل مل لج ا بيلام نلأل يج بةاممرم امم مج امعملاألممر ملمي Manhattan امميتج عممثامم مميل ل عمم اعم ل Dice عممثامم ل عمم اعم ل جمم ج تييت ل ل سل غلأ Introduction Arabic language is a real complex and rich in nature and for this any development in text categorization systems become a challenging task. There are many problems with Arabic language like various spelling of certain words, irregular and inflected derived forms, short diacritics and long vowels, and most of its words contain affixes. It consists of 28 letters and written from right to left. It has a very complex morphology that made its major words have a tri-letter root and the rest have quad or penta or hexa -letter root. The Classical Arabic Poetry CAP is written in a certain way it has got Mohammad Iraqi Journal of Science, Vol.51, No.1, 2010, PP. 159-165 ٔٙٓ a verse which is known as bayt or abyat, and is divided into two halves also known as shater or shatrayn. Text categorization is the process of structuring a set of poems according to a group structure that is known in advance [1]. The aim of this paper is to propose a preliminary categorization of CAP using a method of two steps the first is called "Stemmer" which requires specific knowledge about the language [2]. The second is called "N-gram" which is a statistical approach for categorization. Stemming is used to reduce variant word forms to common roots and thereby improve the ability of the system to match query and vocabulary [3]. It makes the text compact and easy to process. The N-gram frequency profiles provide a simple and reliable way to categorize documents in a wide range of categorization tasks. It can be found if two words are semantically similar or dissimilar from the structures of characters of these words [4]. Many research works used the N-gram method in categorizing Arabic text like [5] who developed the first automatic classification technique based on the character structure of words. Dice's similarity coefficient is computed from the number of matching bi-grams (2-gram) in pairs of character strings, and used to cluster sets of character strings. [6] used tri-grams for indexing Arabic documents without any prior stemming. [7] used N-grams with and without stemming for text searching. Their results indicated that the use of tri-grams combined with stemming improved the performance of search retrieval. [8] assessed the performance of two N-gram matching techniques for Arabic root-driven string searching. [9] used N-grams for searching Arabic text documents. They investigated di-grams and tri-grams without using stemming. They concluded that the Ngram technique is not an efficient approach to corpus-based Arabic word conflation. [1] used N-gram frequency statistics technique for classifying Arabic text documents. For each document to be classified, the N-gram frequency profile was generated and compared against the N-gram frequency profiles of all the training classes. The comparison is done by calculating Manhattan distance and Dice's measure. [10] presented the N-gram model which can be used to compute the similarity between two strings by counting the number of similar N-grams they share. [11] presented an approach that uses Ngram based on the word and characters. Four basic types have been explored either separated or combined: word, lexical root, root, and Ngram. This paper is organized as follows: the stemming algorithm will be discussed in section two, while in section three the N-gram concept will be explained. A detailed description of the whole system will be discussed in section four. Finally the results followed by the conclusion will be presented. Stemmer Stemmer is an automatic process used to reduce the different morphological forms of words into common root (Stem) to improve the performance of the extraction system [4]. The light stemmer approach that presented by A. Chen and F. Gey [12] will be used. They applied the following rules: If the word is at least five-character long, remove the first three characters if they are one of the following: ،للا، ، ك لاّو ،للاا ،لاا ،للاس ،للا ،لاو للب ،للف. (like ِةلأبك will be ةأ). If the word is at least four-character long, remove the first two characters if they are one of the following: ، ،ٌا، ، و ،ًو ،لو ،باك ،باف ،لا ،او بب ،مّن ،وو ،ثو ،ةو (like ٌ أَ، will be ٌ أ). If the word is at least four-character long and begins with و, remove the initial letter و. (like لبيو will be لبي). If the word is at least four-character long and begins with either ب or ل, remove ب or ل.( like ىخَن will be ىخٍ). Recursively strips the following two-character suffixes in the order of presentation if the word is at least four characters long before removing a suffix: ،او ،ٌَ ،بٍ ،ٍه ،ىك ،ٍك ،ىح ،ٍح ،ٍٍ ،ٌا ،ثا ،ٌو به ،تٍ ،ىه ،بَ ،بي (like به دش will be دش ). Recursively strips the following one-character suffixes in the order of presentation if the word is at least three-character long before removing a suffix: ة ،ِ ،ً ،ث. (like ْجُّح will be ٍ ح). N-grams N-gram is a sub-sequence of N-items in any given sequence, where the grams are characters of words. It is N-character slice of a long string. A word is leading or trailing by spaces and these spaces can represent a sequence of N-grams. The value of N can be chosen for a particular corpus. The word ةوخكي can be composed of the following N-grams: Bi-grams: _ة ، ةو ، وح ، ـ خك ، ـ كي ، ـ ي_ Tri-grams: خكي ، ـ كي_ __ة ، _ةو ، ةوح ، وخك ، ـ Mohammad Iraqi Journal of Science, Vol.51, No.1, 2010, PP. 159-165 ٔٙٔ Quad-grams: ___ة ، __ةو ، _ةوح ، ةوخك ، وخكي ، ـ خكي_ The advantages of N-gram are that it does not require a preliminary knowledge of the language, does not require predefined rules, and does not require the construction of a database of vocabulary [4]. Arabic nouns and verbs are heavily prefixed and suffixed, and as a result it is possible to have words with different lengths that share same principal concept. Two words are considered similar if they have in common several substring of N-characters, this is done by calculating a coefficient on these two words. The following bi-gram example shows the similarity between the two words ٌوثرخكًنا ، درخكي :
منابع مشابه
Text Categorization Using n-Gram Based Language Independent Technique
This paper presents a language and topic independent, bytelevel n-gram technique for topic-based text categorization. The technique relies on an n-gram frequency statistics method for document representation, and a variant of k nearest neighbors machine learning algorithm for categorization process. It does not require any morphological analysis of texts, any preprocessing steps, or any prior i...
متن کاملArabic Text Classification Using N-Gram Frequency Statistics A Comparative Study
This paper presents the results of classifying Arabic text documents using the N-gram frequency statistics technique employing a dissimilarity measure called the “Manhattan distance”, and Dice’s measure of similarity. The Dice measure was used for comparison purposes. Results show that N-gram text classification using the Dice measure outperforms classification using the Manhattan measure.
متن کاملSerbian Text Categorization Using Byte Level n-Grams
This paper presents the results of classifying Serbian text documents using the byte-level n-gram based frequency statistics technique, employing four different dissimilarity measures. Results show that the byte-level n-grams text categorization, although very simple and language independent, achieves very good accuracy.
متن کاملLanguage Identification from Text Using N-gram Based Cumulative Frequency Addition
This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-...
متن کاملA Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics
This paper gives an analysis of multi-class e-mail categorization performance, comparing a character n-gram document representation against a word-frequency based representation. Furthermore the impact of using available e-mail specific meta-information on classification performance is explored and the findings are presented.
متن کامل